Analysis of Boston data

Explore the structure and the dimensions of the data and describe the dataset briefly. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506  14


Standardization

In standardization means of all variables are in zero. That is, variables have distributed around zero.

      crim                 zn               indus        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
 Median :-0.390280   Median :-0.48724   Median :-0.2109  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
      chas              nox                rm               age         
 Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
 1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
 Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
 Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
      dis               rad               tax             ptratio       
 Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  
[1] "matrix"
crime
     low  med_low med_high     high 
     127      126      126      127 
Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2623762 0.2376238 0.2549505 0.2450495 

Group means:
                  zn      indus         chas        nox         rm
low       0.98130174 -0.9066431 -0.123759247 -0.8840549  0.4727977
med_low  -0.09732623 -0.2797183  0.014751158 -0.5526505 -0.1612113
med_high -0.37817407  0.2234619  0.224586496  0.4283399  0.1858054
high     -0.48724019  1.0171737  0.006051757  1.0425648 -0.4943251
                age        dis        rad        tax     ptratio
low      -0.8799711  0.8776391 -0.6925875 -0.7290337 -0.48886393
med_low  -0.2726170  0.3687760 -0.5583740 -0.4744113 -0.04393502
med_high  0.4031911 -0.4143320 -0.3853374 -0.2851164 -0.36468085
high      0.8013977 -0.8371692  1.6375616  1.5136504  0.78011702
              black       lstat        medv
low       0.3808634 -0.78888000  0.56746998
med_low   0.3145534 -0.08442091 -0.03086276
med_high  0.0692395 -0.07036499  0.24862295
high     -0.7205699  0.90561295 -0.66275564

Coefficients of linear discriminants:
                 LD1          LD2        LD3
zn       0.090717939  0.641865637 -0.8647445
indus   -0.009184624 -0.249613915  0.1627299
chas    -0.077574894 -0.031438525  0.1405313
nox      0.384221239 -0.732333796 -1.2477158
rm      -0.101302748 -0.096585547 -0.1739006
age      0.281926343 -0.306344858  0.0215151
dis     -0.099435704 -0.148680544  0.2346623
rad      2.987569110  0.849616989 -0.2434196
tax      0.011829136  0.073820136  0.5454135
ptratio  0.110676590  0.045970115 -0.1260826
black   -0.118508965  0.006123112  0.1066025
lstat    0.201672706 -0.078432106  0.5001612
medv     0.181386179 -0.275042300 -0.1260297

Proportion of trace:
   LD1    LD2    LD3 
0.9424 0.0422 0.0154 

          predicted
correct    low med_low med_high high
  low       11      10        0    0
  med_low   10      17        3    0
  med_high   0      10       13    0
  high       0       0        0   28
      crim                 zn               indus        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
 Median :-0.390280   Median :-0.48724   Median :-0.2109  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
      chas              nox                rm               age         
 Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
 1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
 Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
 Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
      dis               rad               tax             ptratio       
 Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  
[1] "matrix"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.119  85.624 170.539 226.315 371.950 626.047 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   2.016  149.145  279.505  342.899  509.707 1198.265 

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?

[1] 404  13
[1] 13  3